-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sonoma Data Scraper #57
Conversation
…into organization
The code is mostly done now. At this point, I'm just doing some clean-up and clarification, hence the draft status. |
It looks like it does. I'll try to fix those conflicts and linter errors tonight |
Looks like all the linter and merge problems have been fixed @Mr0grog @elaguerta |
covid19_sfbayarea/data/sonoma.py
Outdated
import dateutil.parser | ||
from typing import List, Dict, Union | ||
from bs4 import BeautifulSoup, element # type: ignore | ||
from format_error import FormatError # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken for me. If I run ./run_scraper_data.sh sonoma
or python scraper_data.py sonoma
, I get:
Traceback (most recent call last):
File "scraper_data.py", line 4, in <module>
from covid19_sfbayarea import data as data_scrapers
File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/__init__.py", line 4, in <module>
from . import sonoma
File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/sonoma.py", line 8, in <module>
from format_error import FormatError
ModuleNotFoundError: No module named 'format_error'
Does it work for you? Seems like it should be:
from format_error import FormatError # type: ignore | |
from .format_error import FormatError |
(It also shouldn’t need need the # type: ignore
comment.)
Also, while you’re here, maybe it’s worthwhile to make this a shared module up in covid19_sfbayarea/errors.py
instead of covid19_sfbayarea/data/format_error.py
? Some of the news scrapers also use an identical class, and we should really only have one implementation. Then this line would be:
from format_error import FormatError # type: ignore | |
from ..errors import FormatError |
In addition to the HTML tables, they are also using ArcGIS dashboards. I found some endpoints:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just calling out these notes from the last review, which didn’t seem to be addressed one way or the other. Not sure if you missed them or if you’re disagreeing (which is also fine; it just helps to be clear about it):
I did just miss those. Let me take a look |
Okay, I think I addressed those three points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one “food for thought” note inline, but this looks good to me overall. 👍
""" | ||
return [el.text for el in row.find_all(['th', 'td'])] | ||
|
||
def row_list_to_dict(row: List[str], headers: List[str]) -> UnformattedSeriesItem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: one of the important things about naming (or aliasing) types like this is to change how you conceptualize your values and functions (e.g. you shouldn’t be thinking of UnformattedSeriesItem
like a shortcut for Dict[str, str]
here; you should be thinking of it like a subclass of dict
— it should conceptually be its own separate thing).
So if you’re changing the return type to something named UnformattedSeriesItem
, it’s probably a good idea to change the function name to not talk about making a dict
and instead call it something like row_list_to_series_item
or something.
deaths = [] | ||
cumul_deaths = 0 | ||
|
||
rows = list(reversed(parse_table(cases_tag))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to comment on this in the review: I don’t think there’s any reason to convert this to a list, since you’re only iterating over it once and not returning it:
rows = list(reversed(parse_table(cases_tag))) | |
rows = reversed(parse_table(cases_tag)) |
No description provided.